intro.md

Welcome!

Welcome to “How to Create Small Multiples in R with ggplot2!” In this scenario, we will cover how to arrange numerous figures together in a grid or facet.

This scenario assumes you’ve done some data wrangling with tidyr and dplyr, and data visualization with ggplot2.

Knowing your Data

It’s best to start a project off with a “view of the forest from outside the trees.” The technical term for this is data lineage, which:

“Includes the data origin, what happens to it, and where it moves over time.”

Having a “bird’s eye view” of the data ensures there weren’t any problems with exporting or importing. Data lineage also means understanding where the data is coming from (e.g., a relational database, API, flat .csv files, etc.).

Tabular Data

Knowing some of the technical details behind a dataset lets us frame the questions or problems we’re trying to tackle. In this scenario, we will use tabular data data (i.e., spreadsheets). Tabular data organizes information into columns and rows.

Let’s load some data and get started!

step1.md

Initiate R

Launch an R console by clicking here: R

Load packages

The package we’ll use to view the entire dataset with R is skimr. We will install and load the following packages:

install.packages(c("tidyverse", "skimr"))
library(tidyverse)
library(skimr)

step2.md

Multiple Graph Displays

Sometimes data is so complex and layered that it requires multiple graphs. In this case, we will want to arrange numerous figures together in a grid or facet. Charts presented this way are called “small multiples,” a term popularized by Edward Tufte:

“Small multiples resemble the frames of a movie: a series of graphics, showing the same combination of variables, indexed by changes in another variable.” Edward Tufte, The Visual Display of Quantitative Information.

We are going to demonstrate faceting with a dataset we’ve created from the Internet Movie Database (IMDB).

IMDB Data

IMDB makes multiple datasets available for download. We’ve combined the title.ratings.tsv, name.basics.tsv, and title.principals.tsv datasets into the ImdbData dataset with the following columns:

  • tconst = alphanumeric unique identifier of the title (used for joining)
  • nconst = alphanumeric unique identifier of the name/person (used for joining)
  • category = the category of job that person was in
  • primaryName = name by which the person is most often credited
  • birthYear = in YYYY format
  • averageRating = weighted average of all the individual user ratings
  • numVotes = number of votes the title has received
  • primaryTitle = the more popular title/the title used by the filmmakers on promotional materials at the point of release
  • originalTitle = original title, in the original language
  • isAdult = non-adult title or adult title
  • startYear = represents the release year of a title. In the case of TV Series, it is the series start year.
  • runtimeMinutes = primary runtime of the title, in minutes
  • genres = includes up to three genres associated with the title.
  • age_lead = age of actor/actress at time of film (age_lead = startYear - birthYear)
# click to execute code
ImdbData <- readr::read_csv("https://bit.ly/2O2ZKDC")
glimpse(ImdbData)
#> Rows: 136,925
#> Columns: 14
#> $ tconst         <chr> "tt0000574", "tt0000630", "tt0000886", "tt0001101", "…
#> $ nconst         <chr> "nm0846887", "nm0624446", "nm0609814", "nm0923594", "…
#> $ category       <chr> "actress", "actress", "actor", "actor", "actor", "act…
#> $ primaryName    <chr> "Elizabeth Tait", "Fernanda Negri Pouget", "Jean Moun…
#> $ birthYear      <dbl> 1879, 1889, 1841, 1870, 1869, 1844, 1891, 1879, 1867,…
#> $ averageRating  <dbl> 6.1, 3.2, 5.0, 5.2, 4.7, 5.8, 5.5, 3.9, 4.6, 5.3, 4.2…
#> $ numVotes       <dbl> 609, 11, 23, 13, 10, 22, 47, 11, 13, 28, 14, 13, 25, …
#> $ primaryTitle   <chr> "The Story of the Kelly Gang", "Hamlet", "Hamlet, Pri…
#> $ originalTitle  <chr> "The Story of the Kelly Gang", "Amleto", "Hamlet", "A…
#> $ isAdult        <chr> "non-adult title", "non-adult title", "non-adult titl…
#> $ startYear      <dbl> 1906, 1908, 1910, 1910, 1910, 1912, 1910, 1911, 1910,…
#> $ runtimeMinutes <dbl> 70, NA, NA, NA, NA, NA, NA, NA, NA, 50, NA, NA, NA, N…
#> $ genres         <chr> "Biography,Crime,Drama", "Drama", "Drama", "\\N", "Cr…
#> $ age_lead       <dbl> 27, 19, 69, 40, 41, 68, 19, 32, 43, 28, 52, 39, 23, 4…

In the following, we build a skimr::skim() of the ImdbData dataset:

# click to execute code
SkimImdbData <- skimr::skim(ImdbData)
summary(SkimImdbData)
Data summary
Name ImdbData
Number of rows 136925
Number of columns 14
_______________________
Column type frequency:
character 8
numeric 6
________________________
Group variables None

First, we will review the character variables.

# click to execute code
SkimImdbData %>%
  skimr::yank("character")

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
tconst 0 1 9 10 0 136925 0
nconst 0 1 9 10 0 41765 0
category 0 1 5 7 0 2 0
primaryName 0 1 2 38 0 41658 0
primaryTitle 0 1 1 196 0 124228 0
originalTitle 0 1 1 196 0 128171 0
isAdult 0 1 11 15 0 2 0
genres 0 1 2 31 0 1056 0

These all look complete.

Do These Numbers Make Sense?

The number of individual responses (n_unique) for each character variable is a good source for sanity checks. For example, the largest number of unique values belongs to the title id variable (tconst = 136925), and this is identical to the number of rows in the dataset. The next largest number belongs to the originalTitle variable (128171), and the documentation tells us this variable is the title for the film in its original language. By itself, this number doesn’t tell us much, but we can see the next largest number (124228) is the film’s primaryTitle, and it makes sense that the number of unique responses for these two variables is almost the same.

It also makes sense that the n_unique for actor/actress (nconst) is close to the actor/actress primaryName. There should be way more titles (originalTitle or primaryTitle) than genres, and there are (1056).

Finally, we can see the two binary variables we read about above (category and isAdult) only list 2 unique values (in n_unique), so it appears we imported these variables correctly.

Next, we will review the mean, standard deviation (sd), minimum (p0), median (p50), maximum (p100), and hist for the numeric variables in ImdbData:

# click to execute code
SkimImdbData %>%
  skimr::focus(numeric.mean, numeric.sd,
               numeric.p0, numeric.p50, numeric.p100,
               numeric.hist) %>%
  skimr::yank("numeric")

Variable type: numeric

skim_variable mean sd p0 p50 p100 hist
birthYear 1946.85 27.79 1839 1950.0 2015 ▁▂▆▇▂
averageRating 6.00 1.16 1 6.1 10 ▁▂▇▆▁
numVotes 6015.29 43514.99 10 125.0 2334927 ▇▁▁▁▁
startYear 1985.28 26.46 1906 1990.0 2021 ▁▂▃▅▇
runtimeMinutes 97.14 24.13 2 94.0 1500 ▇▁▁▁▁
age_lead 38.43 12.95 1 36.0 98 ▁▇▅▁▁

Do These Numbers Make Sense?

Let’s take a look:

  • The average birthYear is 1947, which is plausible considering the date range for movies in the IMDB (1906 - 2021).

  • The average movie rating is a 6.00, which can be a little confusing considering IMDB’s rating scale. Still, we can feel confident the data isn’t skewed because the mean and median (p50) are relatively close to each other.

  • The number of votes (numVotes) is the most skewed variable because it ranges from 10 to 2334927.

  • The startYear for the movie has an average of 1985, and increases steadily from 1906 to 2021, making sense because more films are being made every year.

  • The average length of each movie in ImdbData is 97.1 minutes (runtimeMinutes). But we can also see from the hist that the range for runtimeMinutes includes some very low and high values (p0 = 2 and p100 = 1500).

The actor/actress’s average age is 38.4, with a low of 1 and a high of 98 (both plausible).

step3.md

Creating Year Categories

We will proceed under the assumption that our stakeholders asked us to help explain the relationship between the average rating a movie received (averageRating) and the number of votes that went into the score (numVotes).

There are quite a few years in this dataset, so instead, we will split each measure into decades. To do this, we need a categorical variable from the startYear variable. The cut() function is handy because we can supply the number of breaks we want to split the numeric startYear variable into (12 in this case). We will also create some clear labels for this categorical variable with the labels argument and make sure the format is ordered.

We check our new factor variable with the fct_count() from the forcats package:

# click to execute code
ImdbData <- ImdbData %>%
  mutate(year_cat10 = cut(x = startYear,
                          breaks = 12,
                          labels = c("1910s", "1920s", "1930s",
                                     "1940s", "1950s", "1960s",
                                     "1970s", "1980s", "1990s",
                                     "2000s",  "2010s", "2020s"),
                          ordered = TRUE))
# check the count of our factor levels
fct_count(f = ImdbData$year_cat10, sort = TRUE)

Number of Votes

We want to examine how the numVotes variable changed over time (year_cat10). Let’s review the numVotes variable below with skimr::skim():

ImdbData$numVotes %>% skimr::skim()
Data summary
Name Piped data
Number of rows 136925
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 6015.29 43514.99 10 33 125 666 2334927 ▇▁▁▁▁

We can see the values for this variable are concentrated, or skewed, towards 0.

Average Individual User Ratings

The averageRating gives us the weighted average of all the individual user ratings. We will refresh our memory about this variable by taking a quick look at the skimr::skim() output below:

ImdbData$averageRating %>%
  skimr::skim()
Data summary
Name Piped data
Number of rows 136925
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 6 1.16 1 5.3 6.1 6.8 10 ▁▂▇▆▁

We want each decade on the x axis and the numVotes for each film on the y.

The distribution for averageRating looks evenly distributed around the mean and median, and the hist looks relatively symmetrical. We will see if this holds for averageRating when we look at it across the levels of year_cat10.

step4.md

Summary of Votes by Year Categories

Below is the mean, standard deviation (sd), minimum (p0), 25th percentile (p25), median (p50), 75th percentile (p75), maximum (p100), and hist for the numVotes, grouped by year_cat10:

# click to execute code
ImdbData %>%
  group_by(year_cat10) %>%
  select(year_cat10, numVotes) %>%
  skimr::skim() %>%
  skimr::focus(numeric.mean, numeric.sd,
               numeric.p0, numeric.p50, numeric.p100,
               numeric.hist)
Data summary
Name Piped data
Number of rows 136925
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables year_cat10

Variable type: numeric

skim_variable mean sd p0 p50 p100 hist
numVotes 223.35 1327.77 10 22 22671 ▇▁▁▁▁
numVotes 579.97 5005.06 10 27 112785 ▇▁▁▁▁
numVotes 686.31 5486.03 10 56 167259 ▇▁▁▁▁
numVotes 981.25 11114.29 10 60 520667 ▇▁▁▁▁
numVotes 1174.25 9046.01 10 78 404337 ▇▁▁▁▁
numVotes 1582.93 14908.81 10 68 687259 ▇▁▁▁▁
numVotes 1635.41 20128.38 10 69 1614179 ▇▁▁▁▁
numVotes 2717.53 30294.50 10 69 1228225 ▇▁▁▁▁
numVotes 4471.46 33454.25 10 85 1266076 ▇▁▁▁▁
numVotes 9814.88 65715.80 10 183 2334927 ▇▁▁▁▁
numVotes 11634.93 60739.55 10 273 2295898 ▇▁▁▁▁
numVotes 9034.78 49947.97 10 282 1512248 ▇▁▁▁▁

Summary of Average Individual User Ratings by Year Categories

Below is the mean, standard deviation (sd), minimum (p0), 25th percentile (p25), median (p50), 75th percentile (p75), maximum (p100), and hist for the averageRating, grouped by year_cat10:

# click to execute code
ImdbData %>%
  group_by(year_cat10) %>%
  select(year_cat10, averageRating) %>%
  skimr::skim() %>%
  skimr::focus(numeric.mean, numeric.sd,
               numeric.p0, numeric.p50, numeric.p100,
               numeric.hist)
Data summary
Name Piped data
Number of rows 136925
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables year_cat10

Variable type: numeric

skim_variable mean sd p0 p50 p100 hist
averageRating 5.75 1.08 2.0 5.9 8.2 ▁▁▆▇▂
averageRating 6.14 1.07 1.0 6.4 9.4 ▁▁▃▇▁
averageRating 6.18 0.91 1.7 6.3 9.0 ▁▁▅▇▁
averageRating 6.19 0.84 1.1 6.3 8.7 ▁▁▂▇▁
averageRating 6.31 0.87 1.9 6.4 9.0 ▁▁▅▇▁
averageRating 6.23 1.00 1.8 6.3 9.2 ▁▁▆▇▁
averageRating 6.11 1.09 1.4 6.2 9.3 ▁▁▆▇▁
averageRating 6.06 1.13 1.1 6.2 9.3 ▁▁▆▇▁
averageRating 6.00 1.17 1.0 6.1 9.5 ▁▂▇▇▁
averageRating 5.90 1.20 1.1 6.1 9.3 ▁▂▇▇▁
averageRating 5.87 1.25 1.0 6.1 9.6 ▁▂▇▇▁
averageRating 5.83 1.28 1.0 6.0 10.0 ▁▂▇▅▁

How do you think the relationship between numVotes and averageRating will look?

Faceting for Multiple Comparisons

We want to make direct comparisons across time by placing an individual plot for each decade within the same view. The best way to accomplish this is by using small multiples, where we repeat the same graph for each snapshot of time and present them in a grid.

The functions for creating small multiples in ggplot2 are facet_wrap() or facet_grid(). We will demonstrate this below using the former, and we also make some adjustments to the x axis to make it easier to read:

Inside the ggplot2::scale_x_continuous() function:

  • Set limits to 10 (the minimum) and 2334927 (the maximum)
  • The breaks argument specifies how we want to ‘break up’ our x axis. We will use the minimum, 1167464 (which is 0.5 x the maximum), and the maximum
  • labels specifies what text we want to display at each break

The ggplot2::facet_wrap() function uses a ~ followed by the categorical variable (year_cat10) we’ve specified in the ggplot(aes(group = )) argument.

Now we can build our labels and small multiples graph:

# click to execute code

#   build labels
labs_avgusr_nmvote_yearcat10 <- labs(
  title = "Number of votes vs average individual user ratings over time",
  subtitle = "Internet Movie Database (IMDB)",
  caption = "https://www.imdb.com",
  y = "Average individual user ratings",
  x = "Number of votes")


gg_step4_facet_01 <- ImdbData %>%
  ggplot(aes(x = numVotes,
             y = averageRating,
             group = year_cat10)) +
  geom_point(size = 0.5,
              alpha = 1/10,
              show.legend = FALSE) +
  # add x scale attributes
  scale_x_continuous(limits = c(10, 2334927),
                     breaks = c(10, 1167464, 2334927),
                     labels = c('10', '~1.2M', '~2.3M')) +
  # add facet
    facet_wrap(~ year_cat10) +
    labs_avgusr_nmvote_yearcat10

# save
# ggsave(plot = gg_step4_facet_01,
#         filename = "gg-step4-facet-01.png",
#         device = "png",
#         width = 9,
#         height = 6,
#         units = "in")
gg_step4_facet_01

Open gg-step4-facet-01.png in the VS Code IDE above the Terminal console to view the graph.

We can see how small multiples allow for more incisive comparisons. The relationship between votes and average user rating becomes slightly evident as time progresses. A few data points approach the 8.75 line in 1960 and 1970, and there appears to be a slight upward trend for average user rating from 1990-2020.

step5.md

Transforming Axes

We will use small multiples again, but this time we will view a log10 transformed numVotes variable. Transforming an axis can sometimes make the display easier to see, but we also need to find a way to explain this change to our audience.

We will need to update our labels, add the scale_x_log10() layer, and use facet_wrap() with year_cat10. We will also use the handy label_log10 function developed by Claus Wilke:

# click to execute code

# build labels
labs_avgusr_lognmvote_yearcat10 <- labs(
  title = "*Number of votes vs average individual user ratings over time",
  subtitle = "Internet Movie Database (https://www.imdb.com/)",
  caption = "*Number of votes is has been log10 transformed",
  y = "Average individual user ratings",
  x = "log10(Number of votes)")

# load label_log10 function
source("https://bit.ly/35Ywt2q")

# build plot
gg_step5_facet_02 <- ImdbData %>%
  ggplot(aes(x = numVotes,
             y = averageRating,
             group = year_cat10)) +
  geom_point(size = 0.5,
             alpha = 1/10,
             show.legend = FALSE) +
  # add log10 x axis
    scale_x_log10(labels = label_log10) +
  # facet wrap on decades
    facet_wrap(~ year_cat10) +
  # add labels
  labs_avgusr_lognmvote_yearcat10

# save
# ggsave(plot = gg_step5_facet_02,
#         filename = "gg-step5-facet-02.png",
#         device = "png",
#         width = 9,
#         height = 6,
#         units = "in")
gg_step5_facet_02

Open gg-step5-facet-02.png in the VS Code IDE above the Terminal console to view the graph.

Now the small multiples show the relationship between the log10 transformed number of votes and average user rating. The majority of the data points appear concentrated around the same value for average user rating (~6.125).

Adding A Best Fit Line

We can use ggplot2‘s geom_smooth() function to draw a smoothed ’best-fit line’ through the data points in each decade. Modeling is beyond the scope of this scenario, but feel free to read more about building models in this chapter of R for Data Science.

For now, know the following:

  • method = 'lm' tells geom_smooth to fit the best straight line through the data points
  • size = 0.75 is the size of the line
  • color = "firebrick2" will make the line color red
  • fullrange = TRUE will draw the line through the entire range of the graph
# click to execute code

#   build labels
labs_avgusr_lognmvote_yearcat10_lm <- labs(
  title = "*Number of votes vs average individual user ratings over time",
  subtitle = "Internet Movie Database (https://www.imdb.com/)",
  caption = "*Number of votes is has been log10 transformed; lm smooth line",
  y = "Average individual user ratings",
  x = "log10(Number of votes)")

gg_step5_facet_03 <- ImdbData %>%
  ggplot(aes(x = numVotes,
             y = averageRating,
             geoup = year_cat10)) +
  geom_point(size = 0.5,
              alpha = 1/10,
              show.legend = FALSE) +
  # add smoothed line
  geom_smooth(method = 'lm',
              size = 0.75,
              color = "firebrick2",
              fullrange = TRUE) +
  # add x axis attributes
    scale_x_log10(labels = label_log10) +
  # add facets
    facet_wrap(~ year_cat10)

# save
# ggsave(plot = gg_step5_facet_03,
#        filename = "gg-step5-facet-03.png",
#         device = "png",
#         width = 9,
#         height = 6,
#         units = "in")
gg_step5_facet_03

Open gg-step5-facet-03.png in the VS Code IDE above the Terminal console to view the graph.

We can see the relationship for the log10 transformed number of votes versus average user rating is slightly positive but has become less pronounced over time. The gray band is the standard error associated with the red smoothed line we’ve drawn (fewer data points = more error).

finish.md

How to Create Small Multiples in R with ggplot2

In this scenario we covered how to:

  1. skim() variables to get summary statistics
  2. Create categorical variables using mutate() and cut()
  3. Build small multiple graphs with ggplot2::facet_wrap()
  4. Transform axes with ggplot2::scale_x_log10() and label_log10()
  5. Save graph images with ggplot2::ggsave()

Thank You!

We’ve concluded the “How to Create Small Multiples in R with ggplot2” scenario! Thank you for completing this scenario, and be sure to check out the other scenarios on the O’Reilly Learning Platform.